Abstract: Prosper is America’s first peer-to- peer lending marketplace, with more than 2 million members and over $2,000,000,000 in funded loans. Here we use the data available to the public (last updated on March 11th, 2014) from Udacity, which contains all the listings and loans ever created with 81 variables on each loan/listing, to do some data analysis.
What are the dimensions of the dataset?
## [1] 113937 81
Let’s take a look at the sheer size of this dataset:
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
This data set contains 113,937 loans with 81 variables such as loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, Prosper score etc.. and the latest payment information.
The variable ListingCategory shows the reason a loan was taken, such as for debt consolidation or maybe a student loan. Unfortunately, the data is stored numerically, with the key being listed here. It makes much more sense to see the listing as what they are, rather than the arbitrary number it’s been given. For that, I’ll have to change the ListingCategory variable from numeric to a factor:
# Convert ListingCategory from numeric to factor variable using the keys given
# in the Google Spreadsheet.
labels <- c("Debt Consolidation", "Home Improvement", "Business",
"Personal Loan","Student Use", "Auto", "Baby & Adoption", "Boat",
"Cosmetic Procedure","Engagement Ring", "Green Loans",
"Household Expenses", "Large Purchases","Medical/Dental",
"Motorcycle", "RV", "Taxes", "Vacation", "Wedding","Other",
"Not Applicable")
pld$ListingCategory <- factor(pld$ListingCategory..numeric.,
levels = c(1:6, 8:20, 7, 0),labels = labels)
## Debt Consolidation Home Improvement Business
## 58308 7433 7189
## Personal Loan Student Use Auto
## 2395 756 2572
## Baby & Adoption Boat Cosmetic Procedure
## 199 85 91
## Engagement Ring Green Loans Household Expenses
## 217 59 1996
## Large Purchases Medical/Dental Motorcycle
## 876 1522 304
## RV Taxes Vacation
## 52 885 768
## Wedding Other Not Applicable
## 771 10494 16965
Let’s take a look at the summary of the LoanStatus variable to see what grouping is possible for factor plots:
summary(pld$LoanStatus)
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Too many variables means there will be too many colours to make an effective visualization. There are 6 ‘Past Due’ factors based on how late the borrowers are with their payments and other delinquent customers, it will be much more useful visually if they’re grouped as one. Let’s write the code to do that:
pld <- pld %>% mutate(LoanStatusGroup = ifelse(LoanStatus %in%
c("Cancelled", "Chargedoff", "Defaulted",
"Past Due (1-15 days)","Past Due (16-30 days)",
"Past Due (31-60 days)","Past Due (61-90 days)",
"Past Due (91-120 days)","Past Due (>120 days)"), 0,
ifelse(LoanStatus %in%
c("Current", "FinalPaymentInProgress"),1,2)))
pld$LoanStatusGroup <- factor(pld$LoanStatusGroup, levels = 0:2,
labels = c("Delinquent","Current","Completed"))
Lets create a new variable to indicate the year of Loan Creation
# Create new variable
pld$LoanCreationYear <- format(as.Date(pld$ListingCreationDate),"%Y")
# Convert dates to date class using lubridate's ymd_hms() function
x <- as.character(pld$LoanOriginationDate)
pld$LoanOriginationDate <- ymd_hms(x)
# Convert LoanOriginationQuarter to begin with the year using tidyr
# This also makes sure that any plot axis will be in increasing order of year
pld$LoanOriginationQuarter <- as.character(pld$LoanOriginationQuarter)
pld <- pld %>%
separate (col = LoanOriginationQuarter,
into = c("Quarters", "Year"), sep = " ") %>%
unite(col = LoanOriginationQuarter, Year, Quarters, sep = " ")
pld$LoanOriginationQuarterF <- factor(pld$LoanOriginationQuarter)
Here we perform some preliminary exploration of the dataset by running some summaries of the data and create univariate plots to understand the structure of the individual variables in the dataset
Since I’m dealing with loan data, the first thing I want to check is the loan amounts being requested. Let’s make a simple histogram showing just that:
Summary Stats for LoanOriginalAmount
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
After adjusting the bin width and x-axis breaks, we get a nice histogram showing the loans borrowed ranging from $1000 to the maximum amount of $35,000. The bulk of the count lies below $15,000,which indicates that most of the borrowers using Prosper are looking for small(er) loans.The other thing I noticed was that there are huge spikes in the count on nice, round values, like $10,000, $15,000, and even $20,000 and $25,000. This makes sense; people are likely to fixate on values that are easy to remember, like those that are factors of 5 and 10, rather than arbitrary values in between.
The next question that needs to be asked is Loan Tenure. Let’s answer this question with a histogram
Occupation of Borrowers
Next I want to visualize more data on these borrowers, and there are a variety of variables that can paint a clearer picture. Let’s first see what they do:
I decided to opt for a vertical plot, since there are 68 occupations listed.
The first thing I noticed in this histogram were the two bars a world away from the rest of the occupations - and a quick check told me that it was the count for ‘Other’ and ‘Professional’. It seems like the people signing up to get loans from Prosper are not quite willing to provide their job information, and are opting instead for these two ambiguous options. Most of the professions have pretty low counts, but the diversity is quite interesting. Among all of them, relatively popular ones include analysts, accountants, computer programmers, teachers and executives. Prosper has definitely enabled a wide range of working people to either borrow or invest money.
Purpose of Loan
Now we will see for what purpose borrowers typically take the loans.
Debt Consolidation seems by far the most popular choice, with the rest of the non-ambiguous (or ‘Not Applicable’) occupations much below the 10,000 mark. Acquiring more debt to repay existing debts seems to be the major intent of money borrowers using Prosper. Perhaps they want to structure and streamline their debt payments to one source instead of dealing with multiple loans from varied sources.
Region (State) of Borrowers
Where do the borrowers hail from?
We see that the states of California,Texas, New York are the top 3 states having the most no. of borrowers using Prosper Loans. It is not surprising to see California heading the list, as it was here that the company was first founded.
Next let’s look at the income range of these borrowers Most borrowers seem to be in the income bracket from 25,000 $ to 75,000 $.
EstimatedEffectiveYield - A better measure for a Lender’s Success
Summary Stats for Estimated Effective Yield
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.116 0.162 0.169 0.224 0.320 29084
EstimatedEffectiveYield is said to be better estimate for the lenders than the interest rate because the interest includes processing fees, uncollected interest due to borrower being chargedoff. Plus it also doesn’t include late fines. Hence EstimatedEffectiveYield takes into account all these things and it is thus a better measure. In the figure above we are trying to see the distribution of the EstimatedEffectiveYield and we can see that it is multimodal. We see the most popular EstimatedEffectiveYield is around 0.3 while the mean is around 0.17 represented by the blue dotted line. The multimodal pattern shows that there are multiple EstimatedEffectiveYield that is popular. Strangely we can also see that the some customers have negative EstimatedEffectiveYield. This may mean a lot of things. This may mean that their BorrowerRate is a lot lower than their service fee rate or these customer’s uncollected interest on chargeoff is lot more or they just never payed the late fee and payed back the loans along with the interest always on time.
Distribution for some continuous/categorical features
Lets explore the histogram distribution of other features in this dataset, so we can evaluate them later if they are interesting
We have performed various operations like tidying up the data and adjusting the time series to have quarter yearly durations for future plots. We have investigated several features of the dataset which will be key to further analysis in this EDA.
In this section we investigate relationships between variables, both the main features as well as the supporting variables.
Do Lenders prefer borrowers with better Prosper Score ?
Now lets see what is the distribution of EstimatedEffectiveYield depending on the different ProsperScore. This is important because we want to answer the question: DO LENDERS GET MORE EstimatedEffectiveYield IF THEY HAVE BETTER ProsperScore ?
We are using violin plot instead of box plot for this.
We see some interesting trends here. Here a higher Prosper score means better prospects (with 11 being the best, or lowest risk, score) and lower Prosper score means poor prospects for the lenders (worst, or highest risk, score, is a 1). We can see that for lower ProsperScore, distribution of effective yield is a lot more than the higher ProsperScore. This may mean that lenders charge a variety of interest rate for the borrower with poor prospects as compared to borrowers with better prospect. We can also notice how the median (represented by the black dot)is decreasing as ProsperScore is increasing. This may mean that lenders give more relaxations to borrowers with better ratings as compared to borrowers with poor rating.
Does that mean lenders trust and like borrowers with better ProsperScore?
A higher EstimatedEffectiveYield for lesser ProsperScore borrowers may be due to high late fines because lesser ProsperScore borrowers are more prone to fail to repay their loans in a timely fashion every month. A little more analysis may reveal it more. The reason we need more exploration on this is because EstimatedEffectiveYield includes more things such as late fines and doesn’t include processing fees and others. So, let’s see if borrower’s interest rate shows the same trend for each ProsperScore categories or not because interest rates doesn’t include late fines.
Finally things are revealed much better now! We can clearly observe that for both BorrowerAPR and BorrowerRate which are metrics for interest rates, we see a declining trend as the ProsperScore increases. This justifies even more the fact that lenders somehow prefer to charge less for the borrowers with better ProsperScore as compared to borrowers with inferior ProsperScore.
Listing Category Vs. Loan Amount and Interest Rate
Summary Statistics of Loan Original Amount for each Listing Category
## pld$ListingCategory: Debt Consolidation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 9500 9908 15000 35000
## --------------------------------------------------------
## pld$ListingCategory: Home Improvement
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6000 8092 10250 35000
## --------------------------------------------------------
## pld$ListingCategory: Business
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7279 8927 13500 35000
## --------------------------------------------------------
## pld$ListingCategory: Personal Loan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1700 3000 4557 5500 25000
## --------------------------------------------------------
## pld$ListingCategory: Student Use
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 1500 2600 3515 4988 22000
## --------------------------------------------------------
## pld$ListingCategory: Auto
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 5001 6000 35000
## --------------------------------------------------------
## pld$ListingCategory: Baby & Adoption
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 9000 9751 15000 30000
## --------------------------------------------------------
## pld$ListingCategory: Boat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 7000 8734 12000 35000
## --------------------------------------------------------
## pld$ListingCategory: Cosmetic Procedure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 3550 4000 5684 7512 15000
## --------------------------------------------------------
## pld$ListingCategory: Engagement Ring
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 6500 7637 10000 27000
## --------------------------------------------------------
## pld$ListingCategory: Green Loans
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 5000 8457 13750 25000
## --------------------------------------------------------
## pld$ListingCategory: Household Expenses
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 3000 4000 5285 6000 35000
## --------------------------------------------------------
## pld$ListingCategory: Large Purchases
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 6500 8772 12000 35000
## --------------------------------------------------------
## pld$ListingCategory: Medical/Dental
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 3202 4075 6524 9000 25000
## --------------------------------------------------------
## pld$ListingCategory: Motorcycle
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 3500 4000 5561 6500 35000
## --------------------------------------------------------
## pld$ListingCategory: RV
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2500 4000 6000 8149 11000 25000
## --------------------------------------------------------
## pld$ListingCategory: Taxes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 5000 7580 10000 35000
## --------------------------------------------------------
## pld$ListingCategory: Vacation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 3000 4000 5358 6000 35000
## --------------------------------------------------------
## pld$ListingCategory: Wedding
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2000 4000 7500 8836 13000 35000
## --------------------------------------------------------
## pld$ListingCategory: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4000 5912 7500 35000
## --------------------------------------------------------
## pld$ListingCategory: Not Applicable
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2550 4500 6254 8000 25000
In the above two plots, I examine the loan category, loan amounts and borrowing rate. Debt consolidation as expected has the highest loan amounts across categories; however, unexpectedly Baby and Adoption category is also the highest (tied) loan amount across categories. It would be interesting to investigate the total number of borrowers across these two categories, one would expect debt consolidation to have a large distribution of borrowers.
Also of note,the Student Use category loan amount it quite low. Given the well documented high levels of student debt, we can assume students didn’t suddenly stop having debt but instead didn’t use Prosper to manage their debt or this category covers items other than student tuition such as lunches,dinners or books.
The borrower’s interest rate doesn’t seem to have any particular surprises. The mean borrowing rate across all categories tend to vary between 0.15 and 0.25.
To identify significant relationships between the different variables in the dataset lets go for a correlation matrix.
Let’s select only the variables of interest.
colsToKeep <- c("Term","EstimatedLoss","EstimatedEffectiveYield","LenderYield",
"CreditScoreRangeLower","CreditScoreRangeUpper","MonthlyLoanPayment",
"LoanOriginalAmount","TotalProsperLoans","Investors","BorrowerAPR",
"BorrowerRate","DebtToIncomeRatio","StatedMonthlyIncome")
# Keep only variables in colsToKeep (columns to keep)
prosperloans <- subset(pld, select=colsToKeep)
A correlation matrix will be used to calculate the Pearson coefficients of the various features in the dataset in order to help start the investigation process where by the variables with strong and/or weak relationships will be reviewed.
Correlation Matrix
There are lots of correlated variables in the data set. The most correlated variables (corr = 0.911) are Loan origination amount and monthly loan payment, which is not surprising.
Borrower rate is negatively correlated with Prosper score and credit score. Debt income ratio is slighlty correlated with credit score.
Prosper score is also correlated with loan origination amount & monthly payment.
In this section, we will further examine the relationships between multiple variables of the dataset. We will continue with the strong relationships identified from the correlation matrix in the Bivariate Plots section and also look at some other variables that may be of interest.
Distribution of Loans over the Years
In this plot we see basically see the growth story of Prosper over the years since its inception. This seems to be a very interesting plot that is worthy of further investigation.
Debt-to-Income Ratio and Monthly Income
In comparing debt-to-income ratio with a borrower’s stated monthly income one might expect to see a somewhat obvious, trend that delinquent borrowers would have a lower monthly income and a higher debt-to-income ratio. With overplotting in the scatterplot two additional techniques ( facet grid and geomdensity2d) were used to more clearly reveal any trends or unexpected results.Borrowers with stated incomes over $20k and debt-to-income ratios over 1 were considered as outliers and removed from the plot.
The density contour lines show a high concentration of delinquent borrowers earn less than $3500 a month but have a low debt-to-income ratio of under 0.50. The plot also suggests a negative correlation between monthly income and debt-to-income ratio,i.e the more a borrower makes in monthly income the lower their debt-to-income ratio; However, this does not guarantee the loan will not go into delinquency.
After further review from online resources, it seems that a typical “good” debt-to-income ratio is under approx. 36%.
Estimated Effective Yield by Estimated Loss The correlation matrix showed a high positive correlation (0.9) between Estimated Effective Yield and Estimated Loss.Investigating this relationship further constrained on LoanTerm and ProsperScore reveals some interesting insights.The short term 1 year loan, seems to be having a exponentially increasing Estimated loss in comparison to the other term loans. The widely popular 3 year loan shows Estimated Losses all over the place.
The plot on the right factored by Prosper Score indicates a logical and expected outcome, lower Prosper scores (close to 1) indicating high risks show a higher Estimated Loss and for high Prosper Scores indicating very low risks the Estimated Loss is very low.
Monthly payment and loan amount There are three clusters of points: one is the 1 year term loan, another is the 3 year term period loan, and the last cluster is the 5 year term period loan. A linear regression line has also been added to the plot, which we will explore at the end of this section on Multivariate Plots.
We can try to explain the variance.
We observe the variance explained by risk, represented by the ProsperScore. The bottom of the scatter plot is dominated by loans with a ProsoperScore equal to 11, which represents loans with low risks. The top of the scatter plot is dominated by loans with a ProsoperScore equal to 4, which represents loans with higher risks. Loans with a loan amount higher than $25000 are mostly dominated by a ProsperScore equal or superior to 1.
We looked at Prosper’s growth story over the years since it’s inception in 2005, investigated the relationship between Debt to Income Ratio Vs. Monthly income based on Loan Status which did not reveal anything conclusive because of overplotting.
We have observed that the Prosper score has an influence on montly payments for the same loan origination amount and same term. This Score should be a way to quantify the probability that the loan is going to default or not. The Prosper Score also revealed the quantum of Estimated Loss that is likely to be incurred by the lender. However when we look at the Prosper Score alone it doesn’t seem be a great indicator for the investor if he should invest or not, for the features considered in the Multivariate Analysis.
Linear Mixed Effects Model of Monthly Loan Payment vs. Loan Original Amount:
##
## Call:
## lm(formula = MonthlyLoanPayment ~ LoanOriginalAmount, data = pld)
##
## Residuals:
## Min 1Q Median 3Q Max
## -751.60 -23.45 -1.04 25.94 1499.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.275e+01 3.452e-01 94.9 <2e-16 ***
## LoanOriginalAmount 2.875e-02 3.313e-05 867.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.85 on 113935 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8686
## F-statistic: 7.531e+05 on 1 and 113935 DF, p-value: < 2.2e-16
LoanOriginalAmount should be the most important factor to decide what would be the MonthlyLoanPayment.As we can see that the correlation has very good Adjusted R-squared value of 0.868 or almost 86%. The Residual Standard Error is however quite high almost 70 which indicates the average performance of the linear model.
In this section we select and refine three plots from all the above explorations and present here a summary of some of the most interesting findings.
It is very worthwhile to investigate the creation date distribution of loans in the dataset. This is important as this could cause bias in further investigation, for example, if the majority of the data was observed during 2008-2009 (the financial crisis) this could skew the data towards having a majority of delinquency.
In the above plot we further investigate the plot we made in the Bivariate section. We have expanded the X-axis to include Yearly quarters instead of just showing the year.
Here we see some dampening in the business of Prosper Funding LLC which can be attributed to the financial crisis of 2008. There was a lot of delinquent (bad performing) loans. Post 2009, the company has restructured its business models and policies and has been growing nicely since then.
Here we select the plot of Loan Origination Amount Vs. Monthly Loan Payment from the Multivariate Plots section but we restrict the data for 3 year Term loans, as it is the most popular term for borrowers.
Do lenders prefer borrowers with better ProsperScore ?
This plots shows that interest rates are affected by the Prosper Score for varying risk factors. As the Prosper score improves, the interest rate shows a declining trend visible from the black points. ——
The prospser loans dataset contains over 100k observations with 81 variables spanning across 9 years. Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending was the first obstacle in approaching this dataset.But the overall experience has been quite rewarding in this EDA. I had learned many tools and concepts in Data Analytics with R in this process.
We have been able to analyze that Prosper’s portfolio has really changed between 2009 and 2014.It seems that Prosper doesn’t only use credit scores to assess risk but also several other variables that could be in the dataset, in addition to external data not included in the dataset available.
Instead of just using the Prosper score to decide whether or not to invest, it would be interesting to build a predictive model that quantifies if the person is going to default on their loan. We could use for example, a logistic regression. However there are lots of variables that are correlated,such as loan origination amount, credit score, and Prosper score.It’s important to carefully choose which variables should be included in the model.
Additional data would also enhance this dataset. Having the borrower’s age and sex would allow analysis to possibly discover trends among men and women or young and old. Also, population and state-average-income features, would allow for discovery of the type of environment the borrower lived in.